White Wine Quality Data Analysis Project by Yang Liu

Abstract

In this project, I have chosen to explore and analyze the white wine quality dataset. This dataset contains 4898 white wines with 11 variables on qualifying different attributes. An output variable is also given in the dataset which is the rating of each wine between 0 and 10. In this project, I will analyze the realations between the wine attributes and ratings, and I will explore if there is any strong relationship between the different attributes of the wines.

Dataset

In this section, I have loaded the data and the variable names are shown in the below.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Now let’s see the structure of the variables:

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

We can find there is an X variable there, which is just the indices of wines. Since there is the no missing data in this dataset, I just simply showed the summary for each variable in the below.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Univariate Plots Section

In this section, I will plot several histograms to explore the count distributionsof wines for different variables.

First let’s take a look at the ratings of the wines.

We can find the ratings of the wines follow a normal distribution with center at 6, which shows most of wines got ratings at 5 and 6.

Let’s take a look at the alcohol, we can find with higher alcohol percentage, the counts of wines are decreasing. Alcohol with about 9% have most counts and the data is left skewed.

Let’s take a look at the fixed acidity. We can find the most of wines has fixed acidity between 6 and 8 g/dm^3.

The above histogram is the count of total sulfur dioxide. We can find most of wines have total sulfur dioxide between 100 and 200 mg/dm^3.

This histogram shows the counts for wines with different pH. Most of wines have pH around 3.0 and 3.3.

This histogram shows the counts for wines with residual sugar, we can find most wines have residual sugar under 2.5 g/dm^3.

Last, let’s plot the histograms for every variable in the data under same plot.

Univariate Analysis

What is the structure of your dataset?

There are 4898 observations and 13 variables in this dataset. Among the vaiables, X is the index of the wines and quality is the rating for each wine, and their data type is int. The quality is dependent on all the other variables, which are properties of the wines and they have float data type.

What is/are the main feature(s) of interest in your dataset?

In this dataset, I’m interested in the relations between pH, alcohol and quality. I would like to explore if there is any strong relationship between them.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Density, volatile acidity and free sulfur dioxide may also support my investigation.

Did you create any new variables from existing variables in the dataset?

I didn’t create any new vaiables by far since I’m not familar with all the chemicals. For different chemicals, the standards of high or low is unclear.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Some data are skewed to the left and some are normally distributed, there is no noticable or unusual distributions in the dataset.

Bivariate Plots Section

In this part, let’s take a look at some bivariate plots and try my interests on some variables of this dataset. First let’s take a look at box plot for wine quality.

We can find wines with higher quality rating, above 6, among most of those, the alcohol percentage is above 10%.

For the plot above, we can find wines with different quality rating, their pH is normally distributed and their is no strong relationship between each other.

The above graph is the scatter plot of pH vs. alcohol. In this graph, we didn’t see any strong relationship between pH and alcohol.

The above graph is the scatter plot of residual.sugar vs. pH. In this graph, we didn’t see any strong relationship between residual.sugar and pH.

Density, volatile acidity and free sulfur dioxide may also support my investigation. pH, alcohol and quality

Above is the scatter plot of volatile.acidity vs. pH. My assumption is volatile acidity will affect pH, but from the scatter plot above we didn’t see a strong relationship between each other.

The above plot is total.sulfur.dioxide vs density. We can find with more sulfur dioxide, the density of wine increases.

Let’s take a look at the alcohol vs. density. We can find with the increase on alcohol, the density of the wine drops.

We can find with the plot of pH vs. density, there is no strong relationship between pH and density.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

From the investigation above, we can find there is no strong relationship for pH vs alcohol, residual.sugar vs pH, volatile vs pH, ph vs density. There is a strong linear relationship for total.sulfur.dioxide vs density and alcohol vs density.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Yes, at the beginning I didn’t have much interests on density but we can find alcohol and total.sulfur.dioxide will affect density.

What was the strongest relationship you found?

The strongest relationship is between alcohol and density. With more alcohol in wines, the density drops.

Multivariate Plots Section

In this section, I have ploted several scatter graphs with quality as factor.

First graph above is the relationship between pH and alcohol. We can find it’s hard to observe any useful information which related to the quality of the wines.

The second graph above the is the scatter plot of volatile.acidity vs pH. We can find most of wines with quality rating above five, their volatile acidity is around or under 0.25 g/dm^3, and spread with different pH.

The above graph is the scatter plot between alcohol and density. We can find wines with rating higher than 5 are more distributed with alcohol bigger than 11% and density smaller 0.996 g/cm^3.

The above graph is the scatter plot between pH and density. We can find most wines with rating bigger than 5 are spread out with different pH and density under 0.996 g/cm^3.

The last graph here is the scatter plot between total sulfur dioxide and density. We can find most of wines with rating higher than 5, their total sulfur dioxide is under 200 mg/dm^3. We can notice that their is a weak linear relationship between total sulfur dioxide and density. When sulfur dioxide increases, the density increases.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

From the analysis above, density is actually an important property for rating wines. Higher alcohol percentage will also help improve the rating of alcohol. In cotrast, at the beginning I was interested in the effection of pH on wine quality, but it turns out it doesn’t affect much.

Were there any interesting or surprising interactions between features?

The density has a strong effection on wine quality, which is very surprising to me.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

From this plot, we can find wines with rating at 6 have most counts, and the distribution of ratings follow a normal distribution.

Plot Two

Description Two

From this plot, we can find a strong negative linear relationship between alcohol and density of wines. With high alcohol percentage, the density drops.

Plot Three

Description Three

From this plot, we can find with higher alcohol percentage and lower density, the wines’ rating will be higher.


Reflection

From the last two graphs above, we can find with higher alcohol percentage, the wines will have higher rating. Probally this is related to the fermentation process of the wine making where alcohol was produced by the bacterias. In this project, I think the most struggling thing is I’m still not very familar with ggplot and use of R. There are some better ideas but I don’t know to implement them, which leaves me a large space to keep studying. What was surprising is the dataset provided by instructors. It is a lot of fun by doing this project and I really enjoyed. After I’m more familar with R programming, I will come back and explore more about this dataset.